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ABSTRACT 

We suggest a novel method of clustering and exploratory 
analysis of temporal event sequences data (also known as 
categorical time series) based on three-dimensional data grid 
models. A data set of temporal event sequences can be rep¬ 
resented as a data set of three-dimensional points, each point 
is defined by three variables: a sequence identifier, a time 
value and an event value. Instantiating data grid models to 
the 3D-points turns the problem into 3D-coclustering. 

The sequences are partitioned into clusters, the time vari¬ 
able is discretized into intervals and the events are parti¬ 
tioned into clusters. The cross-product of the univariate 
partitions forms a multivariate partition of the represen¬ 
tation space, i.e., a grid of cells and it also represents a 
nonparametric estimator of the joint distribution of the se¬ 
quences, time and events dimensions. Thus, the sequences 
are grouped together because they have similar joint distri¬ 
bution of time and events, i.e., similar distribution of events 
along the time dimension. The best data grid is computed 
using a parameter-free Bayesian model selection approach. 
We also suggest several criteria for exploiting the resulting 
grid through agglomerative hierarchies, for interpreting the 
clusters of sequences and characterizing their components 
through insightful visualizations. Extensive experiments on 
both synthetic and real-world data sets demonstrate that 
data grid models are efficient, effective and discover mean¬ 
ingful underlying patterns of categorical time series data. 
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1. INTRODUCTION 

Mining data with temporal information is a key challenge in 
the Knowledge Discovery process. Temporal data is complex 
given that an object is described by one or more sequences of 
time-ordered elements or events. Depending on the nature of 
the temporal events (categorical or numerical, time-points or 
time intervals), classical data mining techniques like pattern 
mining, clustering, classification have been instantiated for 
temporal data [14]. 

Here, we focus on categorical times series (cats) data (i.e., 
time-points event sequences data), where each event of a se¬ 
quence is annotated by a time value t. Mining cats data is 
useful in many application domains, e.g., [16] explore Elec¬ 
tronic Medical Records data to find frequent temporal pat¬ 
tern of ICD codes across patients; [11] look for frequent 
user behaviors in unexpected time periods from web logs; 
in social science domain, [15] group individuals with simi¬ 
lar life courses. In the literature, a lot of the efforts have 
been dedicated to pattern mining in cats data, (e.g., fre¬ 
quent temporally-annotated sequence mining in [3]) whereas 
summarizing through clustering such data has received less 
attention (see further related work discussed in section 5). 
Indeed, most of the clustering techniques for sequential data 
are dedicated to sequences without time annotations, i.e., 
only the placement or the sequentiality of events is relevant 
like in biological data, one of the most popular applications 
of sequence clustering. 

In this paper, we suggest a methodology for clustering and 
exploratory analysis of cats data. From a domain expert 
point of view, a clustering of cats data should hold the fol¬ 
lowing features: 

1. Global picture: the clustering technique should pro¬ 
pose a global picture/summary of the underlying data 
structure and show the evolution of clusters of cats 
along the time dimension. 

2. Local pattern detection: the clustering technique should 
also highlight local patterns, e.g., combination of groups 
of events and time segments that characterize a par¬ 
ticular cluster of cats w.r.t. the whole cats data. 

3. Balancing generality and accuracy: the resulting clus¬ 
tering should propose a valuable trade-off between gen¬ 
erality (i.e., a concise summary) and accuracy the de¬ 
scription of input data. 
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4. Parameter-free: To facilitate the task of the analyst, 
computing the clustering should not involve parameter 
tuning. 

5. Exploration abilities: The whole methodology should 
take into account the expert needs and allow him to 
explore the resulting clustering w.r.t. each data di¬ 
mension and at various granularities. 

To the best our knowledge, there exists no clustering tech¬ 
nique for cats having all these properties. The methodology 
we suggest fulfills all the above requirements. The original¬ 
ity of our approach is that the cats data clustering problem 
is seen as a three-dimensional co-clustering problem. The 
three dimensions (or variables) are sequence identifiers, time 
and event. 

Roadmap: In section 2, we suggest Khiops Co-clustering 
(khc) , a 3D co-clustering method for categorical time se¬ 
ries based on data grid models [1]. KHC aims at simultane¬ 
ously partitioning the sequence identifiers into clusters, dis¬ 
cretizing the time into intervals and partitioning the events 
into clusters by optimizing a Bayesian criterion that bets 
on a trade-off between the accuracy and robustness of the 
data grid model. The optimal grid is reached using an user 
parameter-free Bayesian selection method. In section 3, we 
show how to exploit the resulting grid at various granulari¬ 
ties by means of several criteria derived from the optimiza¬ 
tion criterion and information-theoretic measures. Section 4 
reports the experimental validation of our contributions on 
both synthetic and real-world data sets. We discussed fur¬ 
ther related work in section 5 before concluding. 

Example 1 (From cats to data grid models). Let 
us consider a toy example of cats data, made of 4 cats S = 
(Si, £ 2 , S 3 , £ 4 }, 4 events E = {A, B, C, D} and aT = [0; 100] 
timeline. Figure 1 shows the input cats data and the result¬ 
ing 3D-coclustering which we split into two 2D (T x E) co¬ 
clustering, following the two clusters of cats we found. 

Cats Si and S 2 are grouped together because they have sim¬ 
ilar joint distribution of time and events; in other words, 
they have similar distribution of events along the time di¬ 
mension. 

Cats S 3 and £4 also have similar event distribution along 
the time, but their time-event distribution is clearly differ¬ 
ent from the distribution of Si and £ 2 , therefore they belong 
to a different cluster. 

Two time segments are found, Ti = [0; 50] andT 2 =]50; 100], 
which correspond to the two different regimes of events: A, B 
on T! for S 1,2 then C,D on T 2 ; and the opposite behavior 
for £3,4. 

2. CATEGORICAL TIME SERIES AND DATA 
GRID MODELS 

Context and notations. A temporal event sequence Si 
(or a categorical time series, say cats) of length ki > 0 is a 
time-ordered finite set of observations: 

Si = {(ti 1 , eq), ( ti 2 , e; 2 ),..., (ti ki , Ci ki )) 

such that V), 1 < j < iki,tj £ R+, ej £ E and £ is a non- 
ordered set of categorical events. A cats data set is simply a 
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Figure 1: Visualization of results of khc on a toy 
sample data of 4 cats {£1, £2, £3, £4}, using 4 events 
{A, B, C, D } over a [0; 100] timeline. The clustering 
highlight two clusters of cats: the first one composed 
of cats, with events A and B in time interval [0; 50] 
and with events C and D in time interval ]50; 100]; 
the second cluster shows an opposite behavior. 

set of such defined cats T> = {si,..., s n }. We represent D as 
a three-dimensional data set, i.e., with three variables (two 
categorical and one numerical variable): £ for the sequence 
(cats) id variable, T for the time variable and E for the event 
variable. In the following, an object (s, t, e) £ V is called a 
point of the data set and N is the total number of points in 
D. 

This general definition of cats data allows different size of 
cats and does not force the time stamps to be “aligned” for 
all the cats. 

2.1 Data grid models 

This 3D representation is suitable for co-clustering through 
data grid models [1]. To make the paper self-contained, 
we recall the main features of the generic data grid model 
approach and describe its instantiation to cats data. 

A data grid provides a piecewise constant joint density esti¬ 
mation of the input variables. Instantiating data grid models 
to the cats data, the goal is to simultaneously partition the 
categorical variables (sequence ids and events) into clusters 
and to discretize the numerical variable (time). The result is 
a 3D grid whose cells (say coclusters) are defined by a group 
of sequence ids, a group of events and a time interval. Notice 
that in all rigor, we are working only with partitions of vari¬ 
able value sets. However, to simplify the discussion we will 
sometime use a slightly incorrect formulation by mentioning 
a “partition of a variable” and a “partitioned variable”. 

In order to choose the “best” data grid model M* (given the 
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_ Table 1: Notations and definitions _ 

Defintions 

cats identifiers variable, time variable, event variable of data T> 
number of sequences 
number of events in E 
number of points in D 

number of clusters of sequences (resp. clusters of events, time intervals) 
the number of cells of the grid 

cumulated number of points of cluster is of sequences 

cumulated number of points in time interval jr 

cumulated number of points of cluster ig of events 

cumulated number of points of the grid cell (is, jr,i E )) 

the number of sequences in cluster is (resp. event values in cluster i E ) 

number of points in sequence i (resp. with event value i) 


data) from the model space M , we use a Bayesian Maximum 
A Posteriori (MAP) approach. We explore the model space 
while minimizing a Bayesian criterion, called cost. The cost 
criterion bets on a trade-off between the accuracy and the 
robustness of the model and is defined as follows: 

cost(M) = — log (p(M | D)) oc — log (p(M) x p(D \ M)) 

posterior prior likelihood 


We now define the model space M which consists of a family 
of cats data co-clustering models, based on clusters of cats 
ids, time intervals, clusters of events and a multinomial dis¬ 
tribution of all the points on the cells of the resulting data 
grid. 


Definition 1 (Cats data grid models). A cats data 
grid coclustering model is defined by: 


• a number of clusters of cats ids, 

• a number of intervals for the time variable, 

• a number of clusters of events, 

• the repartition of the cats ids into the clusters of cats, 

• the repartition of the events into the clusters of events, 


Definition 2 (cost: data grid evaluation). A data 
grid model for cats co-clustering is optimal if the value of the 
following cost criterion is minimal: 


cost(M) = 

log n + log a + log N + log B(n, ks) + log B(a, k E ) 
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the distribution of the points of the cats data on the 
cells of the data grid, 


where B(n, ks) is the number of partitions ofn elements into 
ks subsets and B(a,k E ) is defined in a similar way. 


• for each cluster of cats (resp. of events), the distribu¬ 
tion of the points that belongs to the cluster on the cats 
(resp. events) of the cluster. 

Boulle [1] has shown that one can obtain an exact ana¬ 
lytic expression of the cost criterion if one consider a data- 
dependent hierarchical prior (on the parameters of a data 
grid model, see definition 1) that is uniform at each stage of 
the hierarchy. Notice that it does not mean that the prior 
is uniform, thus in our case, the MAP approach is different 
from a simple likelihood maximization. The cost criterion is 
then defined as follows: 


The first four lines stand for the a priori probability of the 
grid model and constitute the regularization term of the 
model: the first line corresponds to the a priori term for 
the choice of the number of clusters for S and E, the num¬ 
ber of intervals for T and the choice of partition of S and E 
into value groups. The second line represents the specifica¬ 
tion of the distribution of the N points on the k cells of the 
grid. The third line corresponds to the specification of the 
distribution of the points of each cluster of cats on the cats 
ids. The fourth line specifies the similar distribution for the 
events. 

The last four lines stand for the likelihood of data given 
the model: the fifth line corresponds the likelihood of the 



distribution of the points in the cells using a multinomial 
term. The sixth (resp. seventh) line is the likelihood of cats 
ids (resp. event values) locally to each cluster of cats (resp. 
events). The last line stands for the likelihood of ranks lo¬ 
cally to each time interval. 

The intuition behind the trade-off between the a priori (reg¬ 
ularization) terms and the likelihood terms is as follows: 
complex models (with many clusters of cats and/or events 
and/or many time intervals) are penalized whereas models 
that are closest to the data are preferred. The extreme case 
where we have at most one point per cell will maximize the 
likelihood but we will get a very low a priori probability 
of the grid model, thus a high cost value. The other side 
case, i.e., the null model, is when we have only one cell: 
we have high prior probability but very low likelihood, thus 
high cost value. Grids with low cost value indicate a high a 
posteriori probability p(M | D) and are those of interest be¬ 
cause they achieve a balanced trade-off between accuracy 
and generality. In terms of information theory, negative 
logarithm of probabilities can also be interpreted as code 
length: here, according to the Minimum Description Length 
principle (MDL), the cost criterion can be interpreted as 
the code length of the grid model plus the code length of 
the data given the grid model; and a low cost value also 
means a high compression of the data using grid model M. 

2.2 Data grid optimization 

Optimization algorithm. The optimization of data grid is 
a combinatorial problem: the number of possible partitions 
of n cats is equal to the Bell number B(n) = \ ibr ( we 

have a similar number for the event dimension E) and the 
number of discretizations of N values is 2 N . Obviously, an 
exhaustive search is unfeasible and as far as we know, there is 
no tractable optimal algorithm. Therefore the cost criterion 
is optimized using a greedy bottom-up strategy whose main 
principle is described in pseudo-code Algorithm 1. We start 
with the finest grained data grid, that is made of the finest 
possible univariate partitions (of S, T and E), i.e., based 
on single value intervals or clusters. Then, we evaluate all 
merges between clusters of sequence ids, clusters of events 
and adjacent time intervals and perform the best merge if 
the cost criterion decreases after the merge. We iterate until 
there is no more improvement of the cost criterion. 

A straightforward implementation of the greedy heuristic 
remains a hard problem since each evaluation of the cost 
criterion for a grid M requires 0(naN) time, given that the 
initial finest grid is made of up to n x a x N cells (where n 
is the number of cats ids, a the number of events (|-E|) and 
N the number of points in D). Furthermore, each step of 
algorithm 1 requires 0(n 2 ) (resp. 0(o 2 ), O(N)) evaluations 
of merges of clusters of cats ids (resp. clusters of events, time 
intervals); and there are at most 0 (n + a + N ) steps from 
the finest grained model to the null model. The overall time 
complexity is bounded by 0(naN(n 2 + a 2 + N)(n + a + N)). 
In [1], it has been shown that further optimizations allow to 
reduce the time complexity to 0(N-\/N log N). Advanced 
optimizations combined with sophisticated algorithmic data 
structures mainly exploits (i) the sparseness of the grid, (ii) 
the additivity property of the cost criterion and (in) starts 
from non-maximal grained grid models using pre and post¬ 
optimization heuristics: 


Algorithm 1: KHC: Cats data grid 
Input : M Initial data grid solution 

Output: M *, cost(M*) < cost{M ) final data grid solution 
with improved cost 

1 M* <r- M; 

2 while improved data grid solution do 

s M' <- AT; 

4 forall the Merge m between two clusters of S or 
E or two intervals of T do 

5 M+ <r- M* + m ; //consider merge m for grid 

M* 

e if COS t{M + ) < cost(M’) then 

7 [MV M+; 

8 if cost(M') < cost(M*) then 

9 |_ M* <— M' ; // Improved grid solution 

10 return M* 


(i) Cats data sets represented by 3D points are sparse. 
Among the 0(naN) cells of the grid, at most N cells 
are non-empty. The contribution of empty cells to the 
cost criterion in definition 2 is null, thus each evalu¬ 
ation of a data grid may be performed in O(N) time 
through advanced algorithmic data structures. 

(ii) The additivity of the cost criterion stems from the 
data-dependent hierarchical prior of criterion. It means 
that it can be split in a hierarchy of components of the 
grid model: the variables ( S , T and E), then the parts 
(clusters or intervals) and finally cells. The additivity 
property allows to evaluate all merges between inter¬ 
vals or clusters in 0(N ) time. Moreover, the sparseness 
of the data set ensures that the number of revaluations 
(after the best merge is performed) is small on average. 

(in) Instead of starting from the finest grained grid, for 
tractability concern, the algorithm starts from grids 
with at most 0(x/A) clusters or intervals. Dedicated 
preprocessing and postprocessing heuristics are em¬ 
ployed to locally improve the initial and final solu¬ 
tions produced by algorithm 1. In these heuristics, the 
cost criterion is post-optimized alternatively for each 
variable while the partitions of the others are fixed, 
by moving values across clusters and moving interval 
boundaries for the time variable. 

The optimized version of algorithm 1 is now time-efficient 
but may lead to a local optimum. To alleviate this con¬ 
cern, we use the Variable Neighborhood Search (VNS) meta¬ 
heuristic [4]. The main principle consists of multiple runs 
of the algorithms using various random initial solutions (we 
consider 10 rounds of initialization): it allows anytime op¬ 
timization - the more you optimize, the better the solu¬ 
tion - while not growing the overall time complexity of al¬ 
gorithm 1. Full details of the optimization techniques are 
available in [1]. 

3. EXPLOITING THE GRID 

In some real-world large-scale case studies, the optimal grid 
M* resulting from the optimization algorithm KHC is made 
of several hundreds of clusters of cats ids (or intervals and/or 



clusters of events), i.e. millions of cells, which is difficult to 
exploit and interpret. To alleviate this issue, we suggest 
a grid simplification method together with several criteria 
that allow us to choose the granularity of the grid for further 
analysis, to rank values in clusters and to gain insights in 
the data through meaningful visualizations. 

3.1 Data grid simplification 

Dissimilarity index and grid structure simplification. 

We suggest a simplification method of the grid structure that 
iteratively merge clusters or adjacent intervals - choosing the 
merge generating the least degradation of the grid quality. 
To this end, we introduce a dissimilarity index between clus¬ 
ters or intervals which characterize the impact of the merge 
on the cost criterion. 


Definition 4 (Typicality of a value in a cluster). 
For a value v in a cluster c of the partition X M of dimen¬ 
sion X given the grid model M, the typicality of v is defined 
as: 

t(v,c) = 
i x 
1 -p x m(c) a 

cjcx" Px M (cj)(cost(M\c \ v, Cj U v) — cost(M)) 

°3* c 

where Px M (c) is the probability of having a point with its 
value in cluster c, c\v is the cluster c from which we have 
removed value v, Cj U v is the cluster Cj to which we add 
value v and M\c\v, Cj U v the grid model M after the afore¬ 
mentioned modifications. 


Definition 3 (Dissimilarity index). Let c.i and c .2 
be two parts of a dimension partition of a grid model M 
(i.e. two clusters of sequence ids or events or two adjacent 
intervals). Let M ciUC2 be the grid after merging c.i and 
c .2 . The dissimilarity A(c.i,c. 2 ) between the two parts c.i 
and c .2 is defined as the difference of cost before and after 
the merge: 

A(c.i,c. 2 ) = cost(M c lUc . 2 ) — cost(M) 


When merging clusters that minimize A, we obtain the sub- 
optimal grid M' (with a coarser grain, i.e. simplified) with 
minimal cost degradation, thus with minimal information 
loss w.r.t. the grid M before merging. 

Performing the best merges w.r.t. A iteratively over the 
three partitioned variables without distinction, starting from 
M* until Mq, three agglomerative hierarchies are built and 
the end-user can stop at the chosen granularity that is nec¬ 
essary for the analysis while controlling either the number 
of clusters/cells or the information ratio kept in the model. 
The information ratio of the grid M' is defined as follows: 

IR(M') = {cost(M') — cost(M^)) / (cost(M*) — cost(M $)) 


where M@ is the null model (the grid where no dimension is 
partitioned). 

Building the hierarchies from M* to Mg for the partitioned 
variables S, T and E shows a quadratic time complexity 
w.r.t. the total number of parts of the partitioned vari¬ 
ables of M*. However, generally, khc has already done the 
hard work: the number of parts is small. In practice, the 
computational time for building the hierarchies is negligible 
compared with the optimization phase. 


3.2 Ranking cats and events 

Typicality for ranking categorical values in a cluster. 

When the chosen granularity is reached through agglomer¬ 
ative hierarchy, the number of clusters per categorical di¬ 
mension (cats ids or events) decreases and mechanically the 
number of values per cluster increases, ft could be useful to 
focus on the most representative values (cats ids or events) 
among thousands of values of a cluster. In order to rank 
values in a cluster, we define the typicality of a value as 
follows: 


Intuitively, the typicality evaluates the average impact in 
terms of cost on the grid model quality of removing a value 
v from its cluster c and reassigning it to another cluster 
Cj c. Thus, a value v is representative (say typical) of a 
cluster c if v is “close” to c and “different in average” from 
other clusters Cj ^ c. 


3.3 Insightful visualizations 

Insightful visualizations with Mutual Information and 
Contrast. It is common to visualize 2D coclustering results 
using 2D frequency matrix or heat map. For 3D coclus¬ 
tering, it is useful to select a dimension of interest (in our 
case, sequence ids S) and then we are able to visualize the 
frequency matrix of the two other dimensions (T and E) 
given a cluster c of S. We also suggest two other insightful 
measures for coclusters to be visualized, namely, the Con¬ 
tribution to Mutual Information (CMI) and the Contrast 
- providing additional valuable visual information inaccessi¬ 
ble with only frequency representation. Notice that the con¬ 
tributed visualizations are also valid whatever the dimension 
of interest. 


Definition 5 (Mutual Information and Contribution). 
For a cluster of cats ids Ci s , the mutual information between 
two partitioned variables T nM and E nM (from the partition 
7 tm of time and event variables induced by the grid model 
M) is defined as: 


i 1 =k'T 12 = k E 

MI(T 7rM -E nM ) = J2 MI ni? 

* 1=1 * 2=1 


where MU 1 i 2 


p(ci ii 2 )log 


P{Ciyi 2 ) 

P(dl)p(Ci 2 ) 


where MIi 1 i 2 represent the contribution of cell Ci 1 i 2 to the 
mutual information. 


Thus, if M/q; 2 > 0 then p(cqq) > p(cq)p(cq) and we ob¬ 
serve an excess of interaction between cq and a 2 located in 
cell Ci 1 i 2 defined by time interval Tq and group of events 
Ei 2 . Conversely, if M/qq < 0, then p(cqq) < p(cq)p(cq), 
and we observe a deficit of interactions in cell cqq. Fi¬ 
nally, if M/qq = 0, then either p(cqq) = 0 in which case 
the contribution to MI is null and there is no interaction 
or p(ci 1 i 2 ) = p(cq)p(cq) and the quantity of interactions in 



Cj, i 2 is that expected in case of independence between the 
partitioned variables. 


Definition 6 (Contrast). The contrast between the 
two partitioned variables to be visualized (T nM , E WM ) con¬ 
sidered jointly and S 7r * f is defined as: 


is = k S il=k T i 2 = k E 

Contrast^,E™),S* M )^ ^ ^ £ MI isili2 

*5 = 1 *1 = 1 *2 = 1 


where MI isili2 


p{c i: 


(log 


P( c »sn»2 ) 

p(ci li2 )p(c is ) 


Table 2: Two synthetic patterns: definition. 

Mi 

t £ T" 1 = [0; 250] => e € Sf 1 = {a, b, c} 

t £ T 2 Ml =]250; 500] E™ : L = {d, e, /} 

t £ T 3 Ml =]500; 750] => e £ E ™ 1 = {g , h, i} 
t £ T 4 Ml =]750; 1000] => e £ 1 = {j, k, 1} 


M 2 

t £ T" 2 = [0; 100] ^e£ E ? 2 = {j, k, 1} 
t £ T 2 M2 =4100; 400] => e £ E f 2 = {p, ft, *} 
t £ T-f 2 =]400; 600] => e £ Ef 2 = {d, e, /} 
t £ T 4 M2 =]600; 1000] => e £ 2 = {a, 6, c} 


Again, the sign of MIi s i 1 i 2 values explains what is contrast¬ 
ing in a s w.r.t. the set of all sequence ids from the view 
(T nM , E nM ). Positive and negative values both highlight 
the cells that characterize a s w.r.t. the set of all sequence 
ids; the former says that an excess of interaction is located 
in cell Ci 1 i 2 i s , the latter highlights a deficit of interaction (a 
negative contrast); and M/; 1 i 2 i s = 0 (or nearly) indicates 
no significant contrast. 


random value t (oil T) and an event value e generated ac¬ 
cording to the pattern Mi related to the cats id, i.e. an 
event value randomly chosen in the set Mfit) (see Table 2). 
Furthermore, we consider several noisy versions of this data 
set at various noise level r/ = {0.1, 0.2, 0.3,0.4, 0.5}: when 
generating a point, the probability that the event value ful¬ 
fills the pattern Mi definition is p(e £ Mi(t)) = 1 — r) and 
p(e £ {E\Mi(t)}) = rj. 


While the visualization of CMI of the cells highlight valuable 
information that is local to a cluster of cats, the contrast is 
a global scope visualization. Both CMI and contrast bring 
complementary insights to exploit the summary provided by 
the grid. In our experiments, we show the added-value of 
those visualizations on both synthetic and real data sets. 

4. EXPERIMENTAL VALIDATION 

Our grid-based co-clustering method khc and visualization 
tools are both available under the name khiops at http: 
//www . khiops. com. In this section, to validate our contribu¬ 
tions, we report the experimentations on both synthetic and 
real-world large-scale dblp data sets. These experiments are 
designed to answer the following questions: 


1. Effectiveness: How successful is KHC in co-clustering 
cats, i.e., finding meaningful clusters of cats ids and 
events and intervals of time ? 

2. Efficiency / Sacalability: Considering computational 
time, how does khc scale w.r.t. the data size and char¬ 
acteristics (i.e., the number of points, cats ids, events, 
underlying pattern to be discovered and noise) ? 

3. Knowledge and insights: What kind of insights do the 
resulting grid and the exploitation tools bring in our 
knowledge of the data ? 

4.1 Synthetic data sets 

Let us consider two patterns Mi and M 2 defined on the 
time domain T = [0; 1000] C R + and the set of events E = 
{a, 6,..., k, 1} such that table 2 defines: 
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Let us consider 10 cats following pattern Mi and 10 cats for 
pattern M 2 (we also did the experiments for CM =50 and 
100 cats per pattern). We generate a data set D of N = 2 20 
points (i.e., 5.10 4 points in average per cats). Each point 
is a triplet with a randomly chosen cats id (among 20), a 


Figure 2: Evolution of ARI for synthetic two-pattern 
cats data sets, for CM = 10, 50 and 100 cats per 
pattern and at various levels of noise w.r.t. number 
of points N. 



















































We apply khc to subsets of D of increasing sizes: N varying 
from 2 1 to 2 20 , i.e., overall KHC is experimented through 360 
various synthetic data sets. We compute the Adjusted Rand 
Index (ARI) for each grid generated by KHC to evaluate the 
agreement between cats id clusters of the grid and the two 
underlying patterns. 

The results are reported in Figure 2. We observe that for 
small subsets of D, there is not enough points for KHC to 
discover significant patterns: no cluster of cats id is found 
for N < 64 (i.e., in average 3 points per cats). For CM = 10 
(10 cats per pattern in figure 2(a)), beyond N = 128 points 
(6 points per cats in average), we have ARI = 1 and the 
two underlying patterns are discovered. We see also that 
at noise level r/ < 0.1 , N = 128 points are enough to find 
the patterns; then, more noise implies that more points are 
necessary to discover the patterns. Finally, increasing the 
number of points up to 2 20 does not lead to over-fitting, 
ARI = 1 and is stable. The same observations hold for 
CM = 50 and CM — 100: when the number of cats per 
pattern increases more points are needed. 

Concerning the other variables (time T and event E), gen¬ 
erally speaking, when considering the increasing number of 
points, the true segmentation of time is discovered at the 
same step (or just before) and the true clustering of events 
is discovered first, i.e., just before the clustering of cats - 
both remaining stable with increasing number of points in 
the data. 

Running time. Figure 3 reports running time of KHC on 
various versions of two-pattern data sets for CM = 10, 50,100 
w.r.t. the number of points N. As expected, running time 
increases with the number of points in D but also with CM 
and 77 . For the most “difficult” data set, i.e., N = 2 20 , 
CM = 100 (5200 points per cats in average) and 77 = 0.5, 
KHC finds the two underlying patterns in about 90 minutes: 
the difficulty comes from the time dimension (potentially 2 20 
different values). 

A similar experiment has been led while considering integer 
values of time (T = [0; 1000] C N + ); in that case, KHC finds 
the patterns faster, in 13 minutes. We have led other similar 
experiments with 5 and 10 underlying patterns to be discov¬ 
ered. The main result is that more cluster patterns require 
more computational time and more points to be detected. 

Visualization and characterization of clusters. Let us 

consider the 3D-grid obtained by KHC on the two-pattern 
data set with N = 2 20 points, CM = 10 and 77 = 0.5. Fig¬ 
ure 4 shows 2D-views (T x E) of each cluster of cats found 
by KHC. Frequency, CMI and contrast visualizations bring 
different insights in the data and valuable information on 
the found clusters. 

In figures 4Mi(a) and M 2 (a), for the frequency visualiza¬ 
tion (i.e., the number of points Ni s j T i E for a cell isjriE), 
we already perceive the underlying patterns Mi and M 2 ; 
however, the noise level 77 = 0.5 degrades the visibility of 
patterns. 

Figures 4Mi(b) and M 2 (b) bring to light and characterize 
cluster patterns Mi and M 2 . Red cells relate positive CMI, 
i.e., excess of interactions between T and E in a cell con¬ 
ditionally to current cats ids cluster - that characterizes 
pattern cluster. Light blue cells stand for negative MR 1 i 2 
values, i.e. a slight deficit of interactions corresponding to 
noisy cells that are not significant to definition of the pat- 
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Figure 3: Running time of KHC w.r.t. number of 
points ( N ), number of cats per pattern ( CM ) and 
noise level ( 77 ). 


tern cluster. 

Finally, figures 4Mi(c) and M 2 (c) show the contrasting cells 
of each cats ids cluster: red for positive contrast, blue for 
negative contrast and white for no contrast. For example, let 
consider cluster pattern Mi: despite the noise level, white 
cell ([0; 100], { g , h, i}) is not characteristic of Mi since prob¬ 
ability of event group { g , h, i} in time interval [ 0 ; 100 ] is not 
different from Mi to M 2 . Also, white cell ([401; 500], {d, e, /}) 
shows null contrast since it is common to the definition of 
the two underlying patterns (there is a similar distribution 
of points in the cell due to our data generator). Finally, 
for Mi, red cells an excess of interactions whereas blue cells 
indicates negative contrast (a deficit of interactions) - that 
gives the mirror effect between figures 4Mi(c) and M 2 (c). 

4.2 Big pictures from DBLP bibliography 

In this section, we report the results of an exploratory anal¬ 
ysis of the DBLP data set using our contributions. The 
DBLP Computer Science Bibliography [10] records millions 
of publications (mainly from journals and conference pro¬ 
ceedings) of Computer Science authors since 1936. 

Let us consider a cats-like view of DBLP as a 3D point data 
set: Author x Year x Event, i.e. we consider author’s se¬ 
quence of publications over the years as cats data: Author is 
the cats id, Event is the name of the journal/proceedings/other 
where an author has published and Year is the year he has 
published in the current event. Duplicated points indicate 
that an author has published more than once in an event 
the same year (like e.g., (G. Alonso, 2009, EDBT) appearing 
twice in the data). In this form, DBLP data (downloaded 
in august 2013; 2013 was still incomplete and 2014 refer¬ 
encing just began) contains more than 6.352 million points 
- described by more than 1.297 million authors who have 
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Figure 4 : Tx E 2D-view: visualization of Frequency, Contribution to Mutual Information (CMI) and contrast 
for the two clusters of cats (corresponding to the two underlying patterns Mi and M 2 ) found by KHC. 


published in 6767 events from 1936 to 2014. 

The DBLP cats data shows skewed marginal distributions 
(see hgure 5): (a) 80% of authors have published less than 
5 times; (b) Most of the points come from the last 20 years; 
(c) half the events appear less than 200 times and 80% less 
than 1000 times. 



Figure 5: Basics on marginal statistics for DBLP 
cats data, (a) empirical cumulative distribution of 
authors that have published X times; (b) distribu¬ 
tion of points over the years; (c) empirical cumu¬ 
lative distribution of points with event value e £ E 
that appears X times. 

Big Picture. To conhrm the scalability and robustness of 
our approach and to obtain a global picture of DBLP cats 
data, we apply KHC on the whole data - and to the best of 
our knowledge, it has never been done. A first grid solution 
is obtained after 12 hours. Several rounds of optimization 
allow 12%) improvement of the cost criterion and KHC ends 
after 19 days. Figure 6 relates the evolution of cost improve¬ 
ment (compared with the first output grid solution which is 
already a “good” solution) w.r.t. computational time. We 
observe that most of the improvement is achieved in the first 
three days, then the improvement is saturated. The anytime 
facet of KHC allows us to stop before the completion of all 


rounds of optimization and more generally, it allows the an¬ 
alyst to set the amount of time devoted to the mining phase. 



Figure 6: Evolution of cost improvement (compared 
with the first output grid) w.r.t. running time on 
the whole DBLP cats data set. 

Notice that we have also run KHC on smaller versions of 
DBLP cats data, e.g., with only authors having published 
5 times or more and with only events appearing more than 
40 times in the data: this version of DBLP is made of 240 
thousands of authors and 5K events over the same timeline. 
Obviously, computational time is smaller (first grid obtained 
in 7 hours, rounds of optimization offer 8% of improvement 
and end in 8 days). Similar high-level results described be¬ 
low can be obtained from both experiments since removed 
data correspond to the least frequent events and authors 
(also the least typical) that have the smallest impact on the 
global data grid structure. 

We think it is worth waiting several days for computation 

















































since DBLP cats data is year-scale, thus potential update of 
such analysis is needed once a year. Using the whole DBLP 
cats data set, the final grid M* is made of 267 clusters of au¬ 
thors, 4 time intervals and 565 clusters of events (i.e., 6 x 10 5 
cells) whereas the finest grid is made of about 6.8 x 10 11 cells. 

Using dissimilarity index A and information rate IR, from 
M* to Mg, we build a hierarchy of clusters or intervals for 
each dimension. Figure 8 relates the whole hierarchy for 
the event dimension and shows how the events (proceed¬ 
ings/journals/others) are organized by topic in DBLP from 
cats data point of view. Since 565 clusters are hardly in¬ 
terpretable by humans, figure 8 highlights the hierarchy at 
HierarchicalLevel = 1 — IR(M) = 0.56, i.e., keeping 44% 
information from M*. 

At this granularity, 21 clusters of events can be interpreted 
and labeled easily by looking at the most typical event ti¬ 
tles of the clusters. Indeed, the found clusters correspond to 
Computer Science research sub-fields indexed by DBLP. For 
example, SIGMETRICS, SIGCOMM, INFOCOMM, GLOBE- 
COM, IMC, CoNEXT, IEEE Communications Magazine, 

... are among the most typical events of terminal clusters un¬ 
der the branch 11 (labeled Networks&Communications) be¬ 
cause recurrent researchers in that field regularly published 
in these events over the years and not significantly in other 
fields like e.g., 19: Robotics. Notice also the singularity 
of cluster 8 which mainly consists of references of Comput¬ 
ing Research Repository (CoRR) covering many sub-fields 
of Computer Science research. 



20 clusters of authors 


Figure 7: CMI visualization for Ax E for the whole 
time period. 

At this scale, the grid is made of 21 clusters of events, 4 
time intervals and 20 clusters of authors. Considering the 
2D visualization of CMI for Author and Event dimensions 
(see figure 7), red cells (i.e., positive CMI relates significant 
positive interactions) highlight the diagonal 2D-cells of the 
Ax E matrix with different color intensity depending on the 
time interval or the whole period considered, while the other 
light blue cells indicate a deficit of interactions. The diago¬ 
nal form of interactions between authors and events (actu¬ 
ally almost diagonal due to the singular cluster 8 including 
CoRR) indicates that, at this scale, most of the researchers 
(grouped in clusters) are active exclusively in a unique sub¬ 


field. 

Another interesting observation may be made when consid¬ 
ering the two agglomerate clusters (clusters 1-12 and 13-21 
from figure 8) at the top the hierarchy: the former mainly 
relates to fundamental research while the latter is more fo¬ 
cused on applied research. 

Zoom in the Data Bases and Data Mining fields. Fig¬ 
ure 9 details the sub-hierarchy of cluster 2 (AI, ML, Agents, 
DB, DM) from figure 8. For the same reason as above, and 
even if there exist authors across several sub-fields of clus¬ 
ter 2, Data Bases, Data Mining, Machine Learning, Artifi¬ 
cial Intelligence and Agents Systems are recognized as close 
though different sub-fields - confirming the intuition; par¬ 
ticularly, Data Mining is closer to Data Bases than AI, ML 
or Agents. 

In figure 10, we present two terminal clusters of authors who 
are involved in DB/DM research. Since clusters of authors 
may contain thousands of authors (as shown by figure 5, 
most of them have published less than 5 times) and fre¬ 
quency visualization is not enough to characterize the clus¬ 
ters, we show the 15 most typical authors and what is con¬ 
trasting in their trajectory of publications over the years 
w.r.t. the rest of the data. 

We observe that cluster (a) is made of senior DB researchers 
(H. Garcia-Molina, D. Weikum, D. Agrawal, ...) who are 
characterized by their activity in DB events in the whole 
time period with a strong contrast before 2004 in the top 
DB events. Cluster (b) whose most typical authors are e.g., 
J. Xu Yu, W. Lehner, ... is made of experienced but younger 
DB researchers: the contrast highlight their activity in DB 
field in the last ten years. 

We have also found clusters of authors who are characterized 
(by contrast) by their activity in (core) Data Mining research 
(resp. Semantic Web): typical authors of those clusters are 
well-known experienced researchers like P.S. Yu, J. Han, C. 
Faloutsos, ... (resp. I. Horrocks, S. Staab, W. Neijdl, ...) 
who cover their respective field since its birth. Similar ob¬ 
servations can be made for any other sub-fields discovered 
by the hierarchy of event clusters in figure 8. 

Thus, starting from the big picture provided by the 3D-grid 
computed by khc on the whole DBLP cats data, we are 
able to zoom in discovered sub-fields of Computer Science 
research indexed by DBLP, to obtain the most typical au¬ 
thors of clusters of authors that are involved in the field and 
to explain the characteristics of their sequence of publica¬ 
tions in terms of contrast. 

5. BACKGROUND AND RELATED WORK 

Since most standard numerical time series clustering algo¬ 
rithms are based on (dis) similarity measures, several (dis) 
similarity measures have been designed and exploited for 
categorical time series clustering: e.g., the Levenshtein dis¬ 
tance for clustering life courses [15], the discrete Frechet dis¬ 
tance for clustering migration data [17], the Compression- 
based Distance Measure (CDM) [6]... These measures are 
adapted to classical clustering and partitioning algorithms; 
they often require parameter tuning and do not directly of¬ 
fer abilities to interpret and explore the results for time and 
event variable. 

Coclustering methods may be classified into two different 




Figure 8: Hierarchy of events from DBLP data set. Colored sub-hierarchy at HierarchicalLevel = 
1 — IR(M) = 0.56, where 21 clusters of events may be labeled easily. 1: Software Engineering; 2: 
AI, ML, Agents, DB, DM; 3: Educat’IComput, ComputerHumanlnteract, Visuallnterface; 4: Compu- 
tat’Linguistics, Info&TextRetrieval, NLP, Speech-AudioRecog&Process; 5: FormalMethods, LogicCom- 
put&Prog; 6: Security, Cryptography; 7: DiscreteMaths, Algo, CS&InformTheory; 8: CoRR&InformTheory; 
9: Real-TimeSystems, Parallel-Distrib-GridComputing; 10: GeneralComput, Systems&Software; 11: Net- 
works&Communications; 12: Systlntegration-Config-Design-Architecture; 13: Simulation, AppliedMaths, 
OperatResearch; 14: Medicallnfo, SignalProcess; 15: ComputBiology, Bioinfo; 16: DiverseAppliedCS-1; 
17: DiverseAppliedCS-2; 18: AppliedAI&NeuralNetworks&FuzzySystems; 19: Robotics; 20: CompGraphics, 
InfoVisualization; 21: ComputerVision, PatternRec, Multimedia, ImageProcess 
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Figure 9: Sub-hierarchy corresponding to close but different sub-fields: Machine Learning, Artificial Intelli¬ 
gence, Agents, Data Bases, Data Mining. (X-axis: HierarchicalLevel = 1 — IR(M), X-axis: most typical events 
of terminal clusters). 
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BengChin Ooi 
H.V. Jagadish 
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Figure 10: Visualization of two clusters of authors, 
their typical authors, and contrast in 2D T x E view. 
Positive contrast (red cells) highlights what is char¬ 
acteristic of the cluster. ( DBi and DMj correspond 
to terminal clusters highlight in figure 9. 


branches: 


• coclustering methods for object x attributes (see pio¬ 
neering work [5]) 

• coclustering methods for two or more attributes like 
Dhillon et al. [2], which is the most related to our work 


Dhillon et al. [2] have proposed an information-theoretic co¬ 
clustering approach for two discrete random variables: the 
loss in Mutual Information MI(X,Y) - MI(X nM ,Y nM ) 
is minimized to obtain a locally-optimal grid with a user- 
defined mandatory number of clusters for each dimension. 

The Information Bottleneck (IB) method [20] stems from an¬ 
other information-theoretic paradigm: Given the joint prob¬ 
ability P(X,Y), IB aims at grouping X into clusters T in 
order to both compress X and keep as much information as 
possible about Y. IB also minimizes a difference in Mutual 
Information: MI(T, X) — j3MI(T, Y), where /3 is a positive 
Lagrange multiplier. Wang et al. [21] build upon IB and 
suggest a coclustering method for two categorical variables. 
Extending IB for more than two categorical variables, Slonim 
et al. [18] have suggested the agglomerative multivariate 
information bottleneck that allows constructing several in¬ 
teracting systems of clusters simultaneously; the interac¬ 
tions among variables are specified using a Bayesian network 
structure. 

There also exist many research works that suggest solutions 
for the problem of segmentation of one event sequence: e.g., 


Kiernan & Terzi [7, 8] suggest a parameter-free method for 
building interpretable summaries (through segmentation) of 
an event sequence. 

Going beyond 2D matrices, recent significant progress has 
been done in multi-way tensor analysis [19, 9]. For instance, 
[13] suggest a method for mining time-stamped event se¬ 
quences and effective forecasting of future events, but does 
not answer to the problem of clustering. Also, [12] explore 
the problem of mining co-evolving time sequences but is 
mainly dedicated to numerical sequences. Both recent ap¬ 
proaches scale well with the time dimension, but it is not 
demonstrated that they are effective and efficient on data 
with many sequences. 

As far as we know, there is no method building upon above 
recent related work and suggesting an effective and efficient 
solution to large-scale clustering of categorical time series. 

6. CONCLUSION & DISCUSSION 

We have suggested a method for coclustering and exploratory 
analysis of categorical time series (or temporal event se¬ 
quences) based on three-dimensional data grid models. The 
sequence identifiers are grouped into clusters, the time di¬ 
mension is discretized into intervals and the events are also 
grouped into clusters - the whole forming a 3D-grid. The op¬ 
timal grid (the most probable a posteriori in Bayesian terms) 
is obtained with an user parameter-free procedure. To ex¬ 
ploit the resulting grid, we have suggested (i) a dissimilarity 
index between clusters to select the wanted granularity of the 
grid while controlling the information loss; (ii), a criterion, 
namely the typicality, to rank and identify representative 
values in a cluster; (in) two other criteria stemming from 
Mutual Information to characterize, interpret and visualize 
the found clusters. Our insightful findings have been illus¬ 
trated on both synthetic and real-world data sets. 

As future work, we plan to extend the approach to super¬ 
vised cats classification and event forecasting. Looking at 
another direction, due to the genericity of the approach, we 
plan to apply khc to other application domains and data 
sets: actually, we are currently working on the behavior 
analysis of customer trajectories, where a customer (cats) 
is defined by his actions on various communication channels 
(e.g., interactive voice system, after-sales service calls, ser¬ 
vices/products browsing, storefront shops visits, etc.) over 
various time periods. 
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